Partial least squares discriminant analysis (PLS-DA) is a statistical method that combines the techniques of partial least squares (PLS) regression and linear discriminant analysis (LDA).
PLS-DA works by first finding the linear combinations of the predictor variables (X) that are most correlated with the response variable (Y). These linear combinations are called latent variables. The latent variables are then used to train a LDA model, which can be used to classify new samples.
PLS-DA has several advantages over other classification methods. First, it is relatively robust to collinearity, which is a problem that can occur in data sets with many correlated variables. Second, PLS-DA can handle data sets with a large number of predictor variables. Third, PLS-DA can be used to visualize data, which can be helpful for understanding the relationships between the predictor variables and the response variable.
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us: citation('mixOmics')
data(breast.TCGA)
# Extract training data and name each data frame
# There are 3 dataframes (3 omics of the same samples), stored as list
# In each dataframe, the rows represent samples.
# Columns represent feature measurments (genes, miRNA, protein)
X <- list(mRNA = breast.TCGA$data.train$mrna,
miRNA = breast.TCGA$data.train$mirna,
protein = breast.TCGA$data.train$protein)
Y <- breast.TCGA$data.train$subtype
table(Y)
## Y
## Basal Her2 LumA
## 45 30 75
We can use the plsda() function to find combinations of features that discriminate the groups.
The inputs are a matrix of values (sample-rows by feature-columns, X), together with a vector of the groups (Y).
We can also choose the number of components, and whether to add scaling and logratio (as seen in Session 1 with pca()).
plsda.mRNA <- plsda(X$mRNA, Y=Y, ncomp = 3, scale = TRUE, logratio = "CLR")
# plot the result
plotIndiv(plsda.mRNA, ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA",
legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)
plotIndiv() applied to plsda objects projects the samples into plsda component subspace; components 1 and 2 by default. We look at other components, which might show further separation of groups in complex datasets.
plotIndiv(plsda.mRNA, comp = c(1,3), ind.names = FALSE, legend=TRUE, title="PLSDA on mRNA",
legend.title = "Cancer\nsubtype", ellipse = TRUE, ellipse.level = 0.6)
PLSDA with a single component may be sufficient for some experiments (e.g. a binary group comparison of healthy vs disease samples), whereas experiments with many groups might be optimally separated using more components.
Each component is a composite of all features with a weight coefficient (i.e. loadings). Loadings can be displayed using plotLoadings() and select
plotLoadings(plsda.mRNA, comp = 1, contrib = "max", method = "median")
plotLoadings(plsda.mRNA, comp = 2, contrib = "max", method = "median")
loadings.comp1 <- selectVar(plsda.mRNA, comp=1)
loadings.comp2 <- selectVar(plsda.mRNA, comp=2)
head(loadings.comp1$value)
## value.var
## ZNF552 -0.1405038
## SLC43A3 0.1373226
## PREX1 -0.1280330
## FMNL2 0.1274558
## CCNA2 0.1256307
## LMO4 0.1222428
head(loadings.comp2$value)
## value.var
## CDK18 -0.2059362
## TP53INP2 0.2032781
## TRIM45 -0.1901160
## STAT5A -0.1774087
## PLEKHA4 -0.1667928
## ZNF37B -0.1649502
Increasing the number of components improves discrimination up to a point, after which no further value is added.
Similarly, the number of features in each component generally increases with the initial highly weighted features, after which adding more features gives minimial or negative impact (due to noise). It is recommended to optimise the number of components and the number of features in each component, using a tuning procedure.
General advice for choosing parameters in the tuning procedure:
The next two parameters determine how error/success the rate is defined, which will depend on the complexity of the data:
test.keepX <- 2^seq(1:7)
test.keepX
## [1] 2 4 8 16 32 64 128
set.seed(123)
tune.splsda.mRNA <- tune.splsda(X$mRNA, Y = Y, ncomp = 3, validation = "Mfold",
folds = 5, test.keepX = test.keepX,
dist = "centroids.dist",
scale = TRUE, logratio = "CLR",
nrepeat = 30, cpus = 3, progressBar = FALSE,
measure = "BER")
plot(tune.splsda.mRNA)
Inspect the performance (BER), which you want to minimise down, while increasing the number of components, and increasing the number of features used.
A few observations:
Let’s try a slightly simplified sparse PLSDA model, using two components with 32 and 16 genes, respectively.
tuned.splsda.mRNA <- splsda(X$mRNA, Y = Y, ncomp = 2, keepX = c(32, 16))
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 32 + 16 genes",
ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6,
legend.title = "Cancer\nSubtype", legend = TRUE)
# we can re-plot with a background-shading of the group boundary calls
background = background.predict(tuned.splsda.mRNA, comp.predicted=2, dist = "centroids.dist")
plotIndiv(tuned.splsda.mRNA, title="Tuned sPLSDA using 32 + 16 genes",
ind.names = FALSE, ellipse = TRUE, ellipse.level = 0.6,
legend.title = "Cancer\nSubtype", legend = TRUE,
background = background)
The separation between groups looks as good or better than the initial PLSDA model, and requires fewer features.
We can inspect the top-ranked chosen genes and their loadings.
plotLoadings(tuned.splsda.mRNA, comp = 1, contrib = "max", method = "median", size.name = 0.5)
plotLoadings(tuned.splsda.mRNA, comp = 2, contrib = "max", method = "median")
loadings.comp1 <- selectVar(tuned.splsda.mRNA, comp=1)
loadings.comp2 <- selectVar(tuned.splsda.mRNA, comp=2)
head(loadings.comp1$value)
## value.var
## ZNF552 -0.4169182
## KDM4B -0.3691667
## PREX1 -0.3014427
## LRIG1 -0.2998513
## CCNA2 0.2802598
## TTC39A -0.2671062
head(loadings.comp2$value)
## value.var
## TP53INP2 0.5919659
## CDK18 -0.5623317
## NDRG2 -0.2928047
## TRIM45 -0.2842161
## STAT5A -0.2480526
## PLEKHA4 -0.2210436
Stability of the selected features can be reviewed with the perf() function, which estimates the mean squared error of prediction (MSEP), R2, and Q2 metrics.
These results give an indication of how frequently the features are selected, if the procedure is repeated in folds.
cim() is a plotting function in mixOmics, which can generate a basic heat map of expression (or other values, depending on context). cim is an abbreviation of Clustered Image Map.
Note that saving the output (large image) as a file is recommended, rather than viewing in RStudio directly. Alternatively, you can open a new window for graphing with X11().
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA")
We can try making the heatmap more informative by:
legend=list(legend = levels(Y), # set of classes
col = unique(color.mixo(Y)), # set of colours
title = "Cancer Subtype", # legend title
cex = 0.7) # legend size
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp1", comp = 1,
row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE, ylab = "Samples")
cim(tuned.splsda.mRNA, save = "png", name.save = "splsda.mRNA.comp2", comp = 2,
row.sideColors = color.mixo(Y), legend=legend, row.names = FALSE, ylab = "Samples")
sPLS-DA models can be useful to find a good combination of distinctive features that are associated with the experimental groupings. These models can also be used to make predictions, classifying a new sample.
# Prepare test set data
data.test.mRNA <- breast.TCGA$data.test$mrna
Y.test <- breast.TCGA$data.test$subtype
table(Y.test)
## Y.test
## Basal Her2 LumA
## 21 14 35
predict.mRNA <- predict(tuned.splsda.mRNA, newdata = data.test.mRNA,
dist = "centroids.dist")
predict.mRNA # List the different outputs
##
## Call:
## predict.mixo_spls(object = tuned.splsda.mRNA, newdata = data.test.mRNA, dist = "centroids.dist")
##
## Main numerical outputs:
## --------------------
## Prediction values of the test samples for each component: see object$predict
## variates of the test samples: see object$variates
## Classification of the test samples for each distance and for each component: see object$class
head(predict.mRNA$predict)
## , , dim1
##
## Basal Her2 LumA
## A54N 0.8146634 0.2570645 -0.07172794
## A2NL 0.9710992 0.2744097 -0.24550891
## A6VY 0.7322610 0.2479279 0.01981105
## A3XT 0.8082378 0.2563520 -0.06458983
## A1F0 0.6502978 0.2388401 0.11086215
## A26I 0.7176443 0.2463073 0.03604847
##
## , , dim2
##
## Basal Her2 LumA
## A54N 1.0458605 -0.16729274 0.12143224
## A2NL 1.0783905 0.07747889 -0.15586937
## A6VY 0.8089302 0.10720334 0.08386649
## A3XT 1.0456035 -0.17932769 0.13372416
## A1F0 0.8087518 -0.05199884 0.24324707
## A26I 0.6536870 0.36369948 -0.01738646
confusion.mat.mRNA <- get.confusion_matrix(truth = Y.test,
predicted = predict.mRNA$MajorityVote$centroids.dist[,2])
confusion.mat.mRNA
## predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal 20 1 0
## Her2 0 14 0
## LumA 0 1 34
get.BER(confusion.mat.mRNA)
## [1] 0.02539683